Goto

Collaborating Authors

 independent policy gradient method


Independent Policy Gradient Methods for Competitive Reinforcement Learning

Neural Information Processing Systems

MinimaxvShapley[63]showed gameG, thereexists( 1, 2)suchthat V ( 1, 2) V ( 1, 2) V ( 1, 2), forall 1, 2, (1) andinparticularV = min 1max 2V ( 1, 2)=max 2min 1V ( 1, 2). Thecruxxplayer timescalethany-player, they-player Compared 43], whichestablishesy-player gradientdominancey-player' ofthegradient t, (y) = ( f(xt, )) (y), then averageusing Is Q-learningprovably Inin Neural Information Processing Systems, pages 4863-4873, 2018.


Independent Policy Gradient Methods for Competitive Reinforcement Learning

Neural Information Processing Systems

We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent learning in competitive RL, as prior work has largely focused on centralized/coordinated procedures for equilibrium computation.


Independent Policy Gradient Methods for Competitive Reinforcement Learning

Neural Information Processing Systems

We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent learning in competitive RL, as prior work has largely focused on centralized/coordinated procedures for equilibrium computation.


Review for NeurIPS paper: Independent Policy Gradient Methods for Competitive Reinforcement Learning

Neural Information Processing Systems

Weaknesses: I am not convinced by the main motivation of this paper for decoupled or independent learning. Specifically, from the communication perspective, once agents can also communicate the actions each other took per round, then each agent can also simulate any coupled algorithm locally (or only coupled online algorithm if has storage limitation). Since agents have to communicate with the oracle or environment in each round anyway, I don't see in practice why communicate the actions in the learning process is that problematic. Second, this paper says that the independent learning is important because it allows the algorithm "being versatile, being applicable even in uncertain environments where the type of interaction and number of other agents are not known to the agent. " I feel this description does not fit the algorithm studied in this paper, thus a bit misleading.


Review for NeurIPS paper: Independent Policy Gradient Methods for Competitive Reinforcement Learning

Neural Information Processing Systems

The reviewers agreed that this is a solid work, on an important problem for which existing results are scarce. However, there were several concerns: - The authors create some confusion in describing their method as "independent" - the agents have to coordinate the learning rates ahead of time. I believe that these concerns actually open the door for interesting followup work, and therefore recommend acceptance. I ask the authors to tone down the independence claims in the final version, given the concern above.


Independent Policy Gradient Methods for Competitive Reinforcement Learning

Neural Information Processing Systems

We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent learning in competitive RL, as prior work has largely focused on centralized/coordinated procedures for equilibrium computation.